The Interaction Loop
In the theater of Reinforcement Learning, the Agent and the Environment perform a continuous dance. At each discrete time step $t$, the agent receives a representation of the environment's State ($S_t$). Based on this, the agent selects an Action ($A_t$). One step later, as a consequence of its action, the agent receives a numerical Reward ($R_{t+1}$) and finds itself in a new state ($S_{t+1}$).
The Finite MDP Framework
A Finite Markov Decision Process (Finite MDP) is the mathematical bedrock of this interaction. It assumes that the sets of states, actions, and rewards are finite. This allows us to define the dynamics of the environment through a single probability distribution: $p(s', r | s, a) = \Pr\{S_t=s', R_t=r | S_{t-1}=s, A_{t-1}=a\}$.
- Agent-Environment Boundary: This is not a physical shell but a functional one. If the agent cannot change it arbitrarily, it is part of the environment. The motor of a robot is the environment; the control software is the agent.
- Episodic Tasks: The interaction breaks into finite sequences called Episodes, ending in a Terminal State (e.g., a game over).
- Continuing Tasks: Interactions that go on forever without a natural end (e.g., process control).